Goto

Collaborating Authors

 visual programming


Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

Neural Information Processing Systems

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.


AIAP: A No-Code Workflow Builder for Non-Experts with Natural Language and Multi-Agent Collaboration

An, Hyunjn, Kim, Yongwon, Seo, Wonduk, Park, Joonil, Kang, Daye, Oh, Changhoon, Kim, Dokyun, Lee, Seunghyun

arXiv.org Artificial Intelligence

While many tools are available for designing AI, non-experts still face challenges in clearly expressing their intent and managing system complexity. We introduce AIAP, a no-code platform that integrates natural language input with visual workflows. AIAP leverages a coordinated multi-agent system to decompose ambiguous user instructions into modular, actionable steps, hidden from users behind a unified interface. A user study involving 32 participants showed that AIAP's AI-generated suggestions, modular workflows, and automatic identification of data, actions, and context significantly improved participants' ability to develop services intuitively. These findings highlight that natural language-based visual programming significantly reduces barriers and enhances user experience in AI service design.


Can We Generate Visual Programs Without Prompting LLMs?

Shlapentokh-Rothman, Michal, Wang, Yu-Xiong, Hoiem, Derek

arXiv.org Artificial Intelligence

Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money. Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments. Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference


Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

Neural Information Processing Systems

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.


Text2VP: Generative AI for Visual Programming and Parametric Modeling

Feng, Guangxi, Yan, Wei

arXiv.org Artificial Intelligence

The integration of generative artificial intelligence (AI) into architectural design has witnessed a significant evolution, marked by the recent advancements in AI to generate text, images, and 3D models. However, no models exist for text-to-parametric models that are used in architectural design for generating various design options, including free-form designs, and optimizing the design options. This study creates and investigates an innovative application of generative AI in parametric modeling by leveraging a customized Text-to-Visual Programming (Text2VP) GPT derived from GPT-4. The primary focus is on automating the generation of graph-based visual programming workflows, including parameters and the links among the parameters, through AI-generated scripts, accurately reflecting users' design intentions and allowing the users to change the parameter values interactively. The Text2VP GPT customization process utilizes detailed and complete documentation of the visual programming language components, example-driven few-shot learning, and specific instructional guides. Our testing demonstrates Text2VP's capability to generate working parametric models. The paper also discusses the limitations of Text2VP; for example, more complex parametric model generation introduces higher error rates. This research highlights the potential of generative AI in visual programming and parametric modeling and sets a foundation for future enhancements to handle more sophisticated and intricate modeling tasks effectively. The study aims to allow designers to create and change design models without significant effort in learning a specific programming language such as Grasshopper.


Recursive Visual Programming

Ge, Jiaxin, Subramanian, Sanjay, Shi, Baifeng, Herzig, Roei, Darrell, Trevor

arXiv.org Artificial Intelligence

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.


VISAR: A Human-AI Argumentative Writing Assistant with Visual Programming and Rapid Draft Prototyping

Zhang, Zheng, Gao, Jie, Dhaliwal, Ranjodh Singh, Li, Toby Jia-Jun

arXiv.org Artificial Intelligence

In argumentative writing, writers must brainstorm hierarchical writing goals, ensure the persuasiveness of their arguments, and revise and organize their plans through drafting. Recent advances in large language models (LLMs) have made interactive text generation through a chat interface (e.g., ChatGPT) possible. However, this approach often neglects implicit writing context and user intent, lacks support for user control and autonomy, and provides limited assistance for sensemaking and revising writing plans. To address these challenges, we introduce VISAR, an AI-enabled writing assistant system designed to help writers brainstorm and revise hierarchical goals within their writing context, organize argument structures through synchronized text editing and visual programming, and enhance persuasiveness with argumentation spark recommendations. VISAR allows users to explore, experiment with, and validate their writing plans using automatic draft prototyping. A controlled lab study confirmed the usability and effectiveness of VISAR in facilitating the argumentative writing planning process.


Visual Programming: Compositional visual reasoning without training

Gupta, Tanmay, Kembhavi, Aniruddha

arXiv.org Artificial Intelligence

We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform.


Machine Learning with Visual Programming

#artificialintelligence

Machine learning (ML) is a part of artificial intelligence (AI) that teaches the computer to work and make decisions based on historical data. A ML algorithm learns from historical data to generate a predictive model used to forecast the future outcome. Advanced forms of ML models could be applied in AI applications, such as Recommender System, Text Processing and Image Recognition. To work with ML, a data scientist should have a good knowledge of mathematics and statistics, and the ability to process data and interpret the results. To process the data, you have to use specific tools or be able to program.


Democratized image analytics by visual programming through integration of deep models and small-scale machine learning

#artificialintelligence

Deep learning1 has revolutionized the field of biomedical image analysis. Conventional approaches have used problem-specific algorithms to describe images with manually crafted features, such as cell morphology, count, intensity, and texture. Feature learning with deep convolutional neural networks is implicit, and training the network usually focuses on particular tasks, such as breast cancer detection in mammography2, subcellular protein localization3, or plant disease detection4. Training a deep network usually requires a large number of images, which limits its utility. For example, the classifier for plant disease detection by Mohanty et al.4 was trained on 54,306 images of diseased and healthy plants, and the yeast protein localization model by Kraus et al.3 was inferred from 22,000 annotated images, but not everyone who could benefit from image analysis has so many well-annotated images.